{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Natural Language Processing - Lecture 2</a>\n",
    "\n",
    "## Linear Regression Models and Regularization\n",
    "\n",
    "In this notebook, we go over Linear Regression methods (with and without regularization: LinearRegression, Ridge, Lasso, ElasticNet) to predict the __log_votes__ field of our review dataset. \n",
    "\n",
    "1. <a href=\"#1\">Reading the dataset</a>\n",
    "2. <a href=\"#2\">Exploratory data analysis</a>\n",
    "3. <a href=\"#3\">Stop word removal and stemming</a>\n",
    "4. <a href=\"#4\">Train - Validation Split</a>\n",
    "5. <a href=\"#5\">Data processing with Pipeline and ColumnTransform</a>\n",
    "6. <a href=\"#6\">Train the regressor</a>\n",
    "7. <a href=\"#7\">Fitting Linear Regression models and checking the validation performance</a> Find more details on the classical Linear Regression models with and without regularization here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model\n",
    "8. <a href=\"#8\">Ideas for improvement</a>\n",
    "\n",
    "Overall dataset schema:\n",
    "* __reviewText:__ Text of the review\n",
    "* __summary:__ Summary of the review\n",
    "* __verified:__ Whether the purchase was verified (True or False)\n",
    "* __time:__ UNIX timestamp for the review\n",
    "* __rating:__ Rating of the review\n",
    "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Reading the dataset</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We will use the __pandas__ library to read our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>summary</th>\n",
       "      <th>verified</th>\n",
       "      <th>time</th>\n",
       "      <th>rating</th>\n",
       "      <th>log_votes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Stuck with this at work, slow and we still got...</td>\n",
       "      <td>Use SEP or Mcafee</td>\n",
       "      <td>False</td>\n",
       "      <td>1464739200</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>I use parallels every day with both my persona...</td>\n",
       "      <td>Use it daily</td>\n",
       "      <td>False</td>\n",
       "      <td>1332892800</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Barbara Robbins\\n\\nI've used TurboTax to do ou...</td>\n",
       "      <td>Helpful Product</td>\n",
       "      <td>True</td>\n",
       "      <td>1398816000</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>I have been using this software security for y...</td>\n",
       "      <td>Five Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1430784000</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>If you want your computer hijacked and slowed ...</td>\n",
       "      <td>... hijacked and slowed to a crawl Windows 10 ...</td>\n",
       "      <td>False</td>\n",
       "      <td>1508025600</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText  \\\n",
       "0  Stuck with this at work, slow and we still got...   \n",
       "1  I use parallels every day with both my persona...   \n",
       "2  Barbara Robbins\\n\\nI've used TurboTax to do ou...   \n",
       "3  I have been using this software security for y...   \n",
       "4  If you want your computer hijacked and slowed ...   \n",
       "\n",
       "                                             summary  verified        time  \\\n",
       "0                                  Use SEP or Mcafee     False  1464739200   \n",
       "1                                       Use it daily     False  1332892800   \n",
       "2                                    Helpful Product      True  1398816000   \n",
       "3                                         Five Stars      True  1430784000   \n",
       "4  ... hijacked and slowed to a crawl Windows 10 ...     False  1508025600   \n",
       "\n",
       "   rating  log_votes  \n",
       "0     1.0        0.0  \n",
       "1     5.0        0.0  \n",
       "2     4.0        0.0  \n",
       "3     5.0        0.0  \n",
       "4     1.0        0.0  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-REGRESSION.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Exploratory data analysis</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the range and distribution of log_votes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"log_votes\"].min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.799753318287247"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"log_votes\"].max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAk0AAAGdCAYAAAAPLEfqAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8fJSN1AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA7B0lEQVR4nO3df3BU9b3/8dcayBrSZJsQ82MvAWkNkRjwtqGGgLeAQALmh0qnYFNXKDTqBQkpyVWx35lirxIEhXpvrojWCwrY2BaxOkCaKIJNIfyIphJEpBUhaEJQlg1JcRPD+f7h5YxL/HGIG3cTn4+ZM5NzznvPvj+rM/vis589azMMwxAAAAC+0CWBbgAAAKA3IDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFvQLdAN9yblz5/T+++8rIiJCNpst0O0AAAALDMPQmTNn5HQ6dcklnz+fRGjyo/fff1+JiYmBbgMAAHRDQ0ODBg0a9LnnCU1+FBERIemTFz0yMjLA3QAAACtaWlqUmJhovo9/HkKTH53/SC4yMpLQBABAL/NlS2tYCA4AAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwIGhCU2lpqWw2m4qKisxjhmFo8eLFcjqdCgsL0/jx43XgwAGfx3m9Xs2fP18xMTEKDw9XXl6ejh8/7lPjdrvlcrnkcDjkcDjkcrl0+vRpn5pjx44pNzdX4eHhiomJUWFhodrb23tquAAAoJcJitC0d+9ePf744xo5cqTP8WXLlmnFihUqKyvT3r17FR8fr8mTJ+vMmTNmTVFRkTZt2qTy8nJVV1ertbVVOTk56uzsNGvy8/NVV1eniooKVVRUqK6uTi6Xyzzf2dmp7OxstbW1qbq6WuXl5dq4caOKi4t7fvAAAKB3MALszJkzRlJSklFVVWWMGzfOWLBggWEYhnHu3DkjPj7eWLp0qVn70UcfGQ6Hw3jssccMwzCM06dPG/379zfKy8vNmvfee8+45JJLjIqKCsMwDOPNN980JBk1NTVmza5duwxJxltvvWUYhmFs2bLFuOSSS4z33nvPrPnd735n2O12w+PxWB6Lx+MxJF3UYwAAQGBZff8O+EzTvHnzlJ2drUmTJvkcP3LkiJqampSZmWkes9vtGjdunHbu3ClJqq2tVUdHh0+N0+lUamqqWbNr1y45HA6lp6ebNaNHj5bD4fCpSU1NldPpNGuysrLk9XpVW1v7ub17vV61tLT4bAAAoG8K6A/2lpeX67XXXtPevXu7nGtqapIkxcXF+RyPi4vT0aNHzZrQ0FBFRUV1qTn/+KamJsXGxna5fmxsrE/Nhc8TFRWl0NBQs+azlJaW6r777vuyYQIAgD4gYDNNDQ0NWrBggdavX69LL730c+su/MVhwzC+9FeIL6z5rPru1Fxo0aJF8ng85tbQ0PCFfQEAgN4rYDNNtbW1am5uVlpamnmss7NTr776qsrKynTo0CFJn8wCJSQkmDXNzc3mrFB8fLza29vldrt9Zpuam5s1ZswYs+bEiRNdnv/kyZM+19m9e7fPebfbrY6Oji4zUJ9mt9tlt9svdujdcvk9m7+W5/Gnd5dmB7oFAAD8JmAzTRMnTtT+/ftVV1dnbqNGjdJPf/pT1dXV6Tvf+Y7i4+NVVVVlPqa9vV07duwwA1FaWpr69+/vU9PY2Kj6+nqzJiMjQx6PR3v27DFrdu/eLY/H41NTX1+vxsZGs6ayslJ2u90n1AEAgG+ugM00RUREKDU11edYeHi4Bg4caB4vKirSkiVLlJSUpKSkJC1ZskQDBgxQfn6+JMnhcGjOnDkqLi7WwIEDFR0drZKSEo0YMcJcWD58+HBNmTJFBQUFWr16tSTptttuU05OjpKTkyVJmZmZSklJkcvl0vLly3Xq1CmVlJSooKBAkZGRX9dLAgAAglhAF4J/mbvuuktnz57V3Llz5Xa7lZ6ersrKSkVERJg1K1euVL9+/TR9+nSdPXtWEydO1Nq1axUSEmLWbNiwQYWFhea37PLy8lRWVmaeDwkJ0ebNmzV37lyNHTtWYWFhys/P10MPPfT1DRYAAAQ1m2EYRqCb6CtaWlrkcDjk8Xj8PkPFmiYAAHqG1ffvgN+nCQAAoDcgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsCGhoWrVqlUaOHKnIyEhFRkYqIyNDW7duNc/PmjVLNpvNZxs9erTPNbxer+bPn6+YmBiFh4crLy9Px48f96lxu91yuVxyOBxyOBxyuVw6ffq0T82xY8eUm5ur8PBwxcTEqLCwUO3t7T02dgAA0LsENDQNGjRIS5cu1b59+7Rv3z5dd911uuGGG3TgwAGzZsqUKWpsbDS3LVu2+FyjqKhImzZtUnl5uaqrq9Xa2qqcnBx1dnaaNfn5+aqrq1NFRYUqKipUV1cnl8tlnu/s7FR2drba2tpUXV2t8vJybdy4UcXFxT3/IgAAgF7BZhiGEegmPi06OlrLly/XnDlzNGvWLJ0+fVrPP//8Z9Z6PB5ddtllWrdunWbMmCFJev/995WYmKgtW7YoKytLBw8eVEpKimpqapSeni5JqqmpUUZGht566y0lJydr69atysnJUUNDg5xOpySpvLxcs2bNUnNzsyIjIy313tLSIofDIY/HY/kxVl1+z2a/Xu/r8O7S7EC3AADAl7L6/h00a5o6OztVXl6utrY2ZWRkmMe3b9+u2NhYDRs2TAUFBWpubjbP1dbWqqOjQ5mZmeYxp9Op1NRU7dy5U5K0a9cuORwOMzBJ0ujRo+VwOHxqUlNTzcAkSVlZWfJ6vaqtrf3cnr1er1paWnw2AADQNwU8NO3fv1/f+ta3ZLfbdccdd2jTpk1KSUmRJE2dOlUbNmzQtm3b9PDDD2vv3r267rrr5PV6JUlNTU0KDQ1VVFSUzzXj4uLU1NRk1sTGxnZ53tjYWJ+auLg4n/NRUVEKDQ01az5LaWmpuU7K4XAoMTGx+y8EAAAIav0C3UBycrLq6up0+vRpbdy4UTNnztSOHTuUkpJifuQmSampqRo1apSGDBmizZs3a9q0aZ97TcMwZLPZzP1P//1Vai60aNEiLVy40NxvaWkhOAEA0EcFfKYpNDRUV1xxhUaNGqXS0lJdffXVeuSRRz6zNiEhQUOGDNHhw4clSfHx8Wpvb5fb7fapa25uNmeO4uPjdeLEiS7XOnnypE/NhTNKbrdbHR0dXWagPs1ut5vf/Du/AQCAvingoelChmGYH79d6MMPP1RDQ4MSEhIkSWlpaerfv7+qqqrMmsbGRtXX12vMmDGSpIyMDHk8Hu3Zs8es2b17tzwej09NfX29GhsbzZrKykrZ7XalpaX5fYwAAKD3CejHc/fee6+mTp2qxMREnTlzRuXl5dq+fbsqKirU2tqqxYsX60c/+pESEhL07rvv6t5771VMTIxuuukmSZLD4dCcOXNUXFysgQMHKjo6WiUlJRoxYoQmTZokSRo+fLimTJmigoICrV69WpJ02223KScnR8nJyZKkzMxMpaSkyOVyafny5Tp16pRKSkpUUFDA7BEAAJAU4NB04sQJuVwuNTY2yuFwaOTIkaqoqNDkyZN19uxZ7d+/X08//bROnz6thIQETZgwQc8++6wiIiLMa6xcuVL9+vXT9OnTdfbsWU2cOFFr165VSEiIWbNhwwYVFhaa37LLy8tTWVmZeT4kJESbN2/W3LlzNXbsWIWFhSk/P18PPfTQ1/diAACAoBZ092nqzbhPky/u0wQA6A163X2aAAAAghmhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGBBQEPTqlWrNHLkSEVGRioyMlIZGRnaunWred4wDC1evFhOp1NhYWEaP368Dhw44HMNr9er+fPnKyYmRuHh4crLy9Px48d9atxut1wulxwOhxwOh1wul06fPu1Tc+zYMeXm5io8PFwxMTEqLCxUe3t7j40dAAD0LgENTYMGDdLSpUu1b98+7du3T9ddd51uuOEGMxgtW7ZMK1asUFlZmfbu3av4+HhNnjxZZ86cMa9RVFSkTZs2qby8XNXV1WptbVVOTo46OzvNmvz8fNXV1amiokIVFRWqq6uTy+Uyz3d2dio7O1ttbW2qrq5WeXm5Nm7cqOLi4q/vxQAAAEHNZhiGEegmPi06OlrLly/X7Nmz5XQ6VVRUpLvvvlvSJ7NKcXFxevDBB3X77bfL4/Hosssu07p16zRjxgxJ0vvvv6/ExERt2bJFWVlZOnjwoFJSUlRTU6P09HRJUk1NjTIyMvTWW28pOTlZW7duVU5OjhoaGuR0OiVJ5eXlmjVrlpqbmxUZGWmp95aWFjkcDnk8HsuPseryezb79Xpfh3eXZge6BQAAvpTV9++gWdPU2dmp8vJytbW1KSMjQ0eOHFFTU5MyMzPNGrvdrnHjxmnnzp2SpNraWnV0dPjUOJ1OpaammjW7du2Sw+EwA5MkjR49Wg6Hw6cmNTXVDEySlJWVJa/Xq9ra2h4dNwAA6B36BbqB/fv3KyMjQx999JG+9a1vadOmTUpJSTEDTVxcnE99XFycjh49KklqampSaGiooqKiutQ0NTWZNbGxsV2eNzY21qfmwueJiopSaGioWfNZvF6vvF6vud/S0mJ12AAAoJcJ+ExTcnKy6urqVFNTo3//93/XzJkz9eabb5rnbTabT71hGF2OXejCms+q707NhUpLS83F5Q6HQ4mJiV/YFwAA6L0CHppCQ0N1xRVXaNSoUSotLdXVV1+tRx55RPHx8ZLUZaanubnZnBWKj49Xe3u73G73F9acOHGiy/OePHnSp+bC53G73ero6OgyA/VpixYtksfjMbeGhoaLHD0AAOgtAh6aLmQYhrxer4YOHar4+HhVVVWZ59rb27Vjxw6NGTNGkpSWlqb+/fv71DQ2Nqq+vt6sycjIkMfj0Z49e8ya3bt3y+Px+NTU19ersbHRrKmsrJTdbldaWtrn9mq3283bJZzfAABA3xTQNU333nuvpk6dqsTERJ05c0bl5eXavn27KioqZLPZVFRUpCVLligpKUlJSUlasmSJBgwYoPz8fEmSw+HQnDlzVFxcrIEDByo6OlolJSUaMWKEJk2aJEkaPny4pkyZooKCAq1evVqSdNtttyknJ0fJycmSpMzMTKWkpMjlcmn58uU6deqUSkpKVFBQQBACAACSAhyaTpw4IZfLpcbGRjkcDo0cOVIVFRWaPHmyJOmuu+7S2bNnNXfuXLndbqWnp6uyslIRERHmNVauXKl+/fpp+vTpOnv2rCZOnKi1a9cqJCTErNmwYYMKCwvNb9nl5eWprKzMPB8SEqLNmzdr7ty5Gjt2rMLCwpSfn6+HHnroa3olAABAsAu6+zT1ZtynyRf3aQIA9Aa97j5NAAAAwYzQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGBBQENTaWmpfvCDHygiIkKxsbG68cYbdejQIZ+aWbNmyWaz+WyjR4/2qfF6vZo/f75iYmIUHh6uvLw8HT9+3KfG7XbL5XLJ4XDI4XDI5XLp9OnTPjXHjh1Tbm6uwsPDFRMTo8LCQrW3t/fI2AEAQO8S0NC0Y8cOzZs3TzU1NaqqqtLHH3+szMxMtbW1+dRNmTJFjY2N5rZlyxaf80VFRdq0aZPKy8tVXV2t1tZW5eTkqLOz06zJz89XXV2dKioqVFFRobq6OrlcLvN8Z2ensrOz1dbWpurqapWXl2vjxo0qLi7u2RcBAAD0Cv0C+eQVFRU++2vWrFFsbKxqa2v1wx/+0Dxut9sVHx//mdfweDx68skntW7dOk2aNEmStH79eiUmJuqll15SVlaWDh48qIqKCtXU1Cg9PV2S9MQTTygjI0OHDh1ScnKyKisr9eabb6qhoUFOp1OS9PDDD2vWrFl64IEHFBkZ2RMvAQAA6CWCak2Tx+ORJEVHR/sc3759u2JjYzVs2DAVFBSoubnZPFdbW6uOjg5lZmaax5xOp1JTU7Vz505J0q5du+RwOMzAJEmjR4+Ww+HwqUlNTTUDkyRlZWXJ6/Wqtrb2M/v1er1qaWnx2QAAQN8UNKHJMAwtXLhQ1157rVJTU83jU6dO1YYNG7Rt2zY9/PDD2rt3r6677jp5vV5JUlNTk0JDQxUVFeVzvbi4ODU1NZk1sbGxXZ4zNjbWpyYuLs7nfFRUlEJDQ82aC5WWlpprpBwOhxITE7v/AgAAgKAW0I/nPu3OO+/UG2+8oerqap/jM2bMMP9OTU3VqFGjNGTIEG3evFnTpk373OsZhiGbzWbuf/rvr1LzaYsWLdLChQvN/ZaWFoITAAB9VFDMNM2fP18vvPCCXnnlFQ0aNOgLaxMSEjRkyBAdPnxYkhQfH6/29na53W6fuubmZnPmKD4+XidOnOhyrZMnT/rUXDij5Ha71dHR0WUG6jy73a7IyEifDQAA9E0BDU2GYejOO+/Uc889p23btmno0KFf+pgPP/xQDQ0NSkhIkCSlpaWpf//+qqqqMmsaGxtVX1+vMWPGSJIyMjLk8Xi0Z88es2b37t3yeDw+NfX19WpsbDRrKisrZbfblZaW5pfxAgCA3qtboenIkSN+efJ58+Zp/fr1euaZZxQREaGmpiY1NTXp7NmzkqTW1laVlJRo165devfdd7V9+3bl5uYqJiZGN910kyTJ4XBozpw5Ki4u1ssvv6zXX39dt9xyi0aMGGF+m2748OGaMmWKCgoKVFNTo5qaGhUUFCgnJ0fJycmSpMzMTKWkpMjlcun111/Xyy+/rJKSEhUUFDCDBAAAuhearrjiCk2YMEHr16/XRx991O0nX7VqlTwej8aPH6+EhARze/bZZyVJISEh2r9/v2644QYNGzZMM2fO1LBhw7Rr1y5FRESY11m5cqVuvPFGTZ8+XWPHjtWAAQP04osvKiQkxKzZsGGDRowYoczMTGVmZmrkyJFat26deT4kJESbN2/WpZdeqrFjx2r69Om68cYb9dBDD3V7fAAAoO+wGYZhXOyD6uvr9b//+7/asGGDvF6vZsyYoTlz5uiaa67piR57jZaWFjkcDnk8Hr/PTl1+z2a/Xu/r8O7S7EC3AADAl7L6/t2tmabU1FStWLFC7733ntasWaOmpiZde+21uuqqq7RixQqdPHmy240DAAAEo6+0ELxfv3666aab9Pvf/14PPvig/vGPf6ikpESDBg3Srbfe6rOoGgAAoDf7SqFp3759mjt3rhISErRixQqVlJToH//4h7Zt26b33ntPN9xwg7/6BAAACKhu3dxyxYoVWrNmjQ4dOqTrr79eTz/9tK6//npdcsknGWzo0KFavXq1rrzySr82CwAAECjdCk2rVq3S7Nmz9bOf/exzf0h38ODBevLJJ79ScwAAAMGiW6Hp/N24v0hoaKhmzpzZncsDAAAEnW6taVqzZo3+8Ic/dDn+hz/8QU899dRXbgoAACDYdCs0LV26VDExMV2Ox8bGasmSJV+5KQAAgGDTrdB09OjRz/yduCFDhujYsWNfuSkAAIBg063QFBsbqzfeeKPL8b/97W8aOHDgV24KAAAg2HQrNN18880qLCzUK6+8os7OTnV2dmrbtm1asGCBbr75Zn/3CAAAEHDd+vbc/fffr6NHj2rixInq1++TS5w7d0633nora5oAAECf1K3QFBoaqmeffVb/+Z//qb/97W8KCwvTiBEjNGTIEH/3BwAAEBS6FZrOGzZsmIYNG+avXgAAAIJWt0JTZ2en1q5dq5dfflnNzc06d+6cz/lt27b5pTkAAIBg0a3QtGDBAq1du1bZ2dlKTU2VzWbzd18AAABBpVuhqby8XL///e91/fXX+7sfAACAoNStWw6Ehobqiiuu8HcvAAAAQatboam4uFiPPPKIDMPwdz8AAABBqVsfz1VXV+uVV17R1q1bddVVV6l///4+55977jm/NAcAABAsuhWavv3tb+umm27ydy8AAABBq1uhac2aNf7uAwAAIKh1a02TJH388cd66aWXtHr1ap05c0aS9P7776u1tdVvzQEAAASLbs00HT16VFOmTNGxY8fk9Xo1efJkRUREaNmyZfroo4/02GOP+btPAACAgOrWTNOCBQs0atQoud1uhYWFmcdvuukmvfzyy35rDgAAIFh0+9tzf/3rXxUaGupzfMiQIXrvvff80hgAAEAw6dZM07lz59TZ2dnl+PHjxxUREfGVmwIAAAg23QpNkydP1m9+8xtz32azqbW1Vb/61a/4aRUAANAndevjuZUrV2rChAlKSUnRRx99pPz8fB0+fFgxMTH63e9+5+8eAQAAAq5bocnpdKqurk6/+93v9Nprr+ncuXOaM2eOfvrTn/osDAcAAOgruhWaJCksLEyzZ8/W7Nmz/dkPAABAUOpWaHr66ae/8Pytt97arWYAAACCVbdC04IFC3z2Ozo69M9//lOhoaEaMGAAoQkAAPQ53fr2nNvt9tlaW1t16NAhXXvttRe1ELy0tFQ/+MEPFBERodjYWN144406dOiQT41hGFq8eLGcTqfCwsI0fvx4HThwwKfG6/Vq/vz5iomJUXh4uPLy8nT8+PEuPbtcLjkcDjkcDrlcLp0+fdqn5tixY8rNzVV4eLhiYmJUWFio9vb2i3txAABAn9Tt3567UFJSkpYuXdplFuqL7NixQ/PmzVNNTY2qqqr08ccfKzMzU21tbWbNsmXLtGLFCpWVlWnv3r2Kj4/X5MmTzd+7k6SioiJt2rRJ5eXlqq6uVmtrq3JycnzuJZWfn6+6ujpVVFSooqJCdXV1crlc5vnOzk5lZ2erra1N1dXVKi8v18aNG1VcXPwVXxkAANAX2AzDMPx1sddff13jxo1TS0tLtx5/8uRJxcbGaseOHfrhD38owzDkdDpVVFSku+++W9Ins0pxcXF68MEHdfvtt8vj8eiyyy7TunXrNGPGDEmf/HBwYmKitmzZoqysLB08eFApKSmqqalRenq6JKmmpkYZGRl66623lJycrK1btyonJ0cNDQ1yOp2SpPLycs2aNUvNzc2KjIz80v5bWlrkcDjk8Xgs1V+My+/Z7NfrfR3eXZod6BYAAPhSVt+/u7Wm6YUXXvDZNwxDjY2NKisr09ixY7tzSUmSx+ORJEVHR0uSjhw5oqamJmVmZpo1drtd48aN086dO3X77bertrZWHR0dPjVOp1OpqanauXOnsrKytGvXLjkcDjMwSdLo0aPlcDi0c+dOJScna9euXUpNTTUDkyRlZWXJ6/WqtrZWEyZM6NKv1+uV1+s197sbFgEAQPDrVmi68cYbffZtNpsuu+wyXXfddXr44Ye71YhhGFq4cKGuvfZapaamSpKampokSXFxcT61cXFxOnr0qFkTGhqqqKioLjXnH9/U1KTY2NguzxkbG+tTc+HzREVFKTQ01Ky5UGlpqe67776LHSoAAOiFuhWazp075+8+dOedd+qNN95QdXV1l3M2m81n3zCMLscudGHNZ9V3p+bTFi1apIULF5r7LS0tSkxM/MK+AABA7+S3heBfxfz58/XCCy/olVde0aBBg8zj8fHxktRlpqe5udmcFYqPj1d7e7vcbvcX1pw4caLL8548edKn5sLncbvd6ujo6DIDdZ7dbldkZKTPBgAA+qZuzTR9enbly6xYseJzzxmGofnz52vTpk3avn27hg4d6nN+6NChio+PV1VVlb73ve9Jktrb27Vjxw49+OCDkqS0tDT1799fVVVVmj59uiSpsbFR9fX1WrZsmSQpIyNDHo9He/bs0TXXXCNJ2r17tzwej8aMGWPWPPDAA2psbFRCQoIkqbKyUna7XWlpaZbHCwAA+qZuhabXX39dr732mj7++GMlJydLkt5++22FhITo+9//vln3ZR+hzZs3T88884z+9Kc/KSIiwpzpcTgcCgsLk81mU1FRkZYsWaKkpCQlJSVpyZIlGjBggPLz883aOXPmqLi4WAMHDlR0dLRKSko0YsQITZo0SZI0fPhwTZkyRQUFBVq9erUk6bbbblNOTo7Zf2ZmplJSUuRyubR8+XKdOnVKJSUlKigoYAYJAAB0LzTl5uYqIiJCTz31lLkA2+1262c/+5n+7d/+zfK9jVatWiVJGj9+vM/xNWvWaNasWZKku+66S2fPntXcuXPldruVnp6uyspKRUREmPUrV65Uv379NH36dJ09e1YTJ07U2rVrFRISYtZs2LBBhYWF5rfs8vLyVFZWZp4PCQnR5s2bNXfuXI0dO1ZhYWHKz8/XQw89dNGvDwAA6Hu6dZ+mf/mXf1FlZaWuuuoqn+P19fXKzMzU+++/77cGexPu0+SL+zQBAHoDq+/f3VoI3tLS8pkLq5ubm33u1A0AANBXdCs03XTTTfrZz36mP/7xjzp+/LiOHz+uP/7xj5ozZ46mTZvm7x4BAAACrltrmh577DGVlJTolltuUUdHxycX6tdPc+bM0fLly/3aIAAAQDDoVmgaMGCAHn30US1fvlz/+Mc/ZBiGrrjiCoWHh/u7PwAAgKDwlW5u2djYqMbGRg0bNkzh4eHy42//AgAABJVuhaYPP/xQEydO1LBhw3T99dersbFRkvTzn//c8u0GAAAAepNuhaZf/OIX6t+/v44dO6YBAwaYx2fMmKGKigq/NQcAABAsurWmqbKyUn/+8599fidOkpKSknT06FG/NAYAABBMujXT1NbW5jPDdN4HH3wgu93+lZsCAAAINt0KTT/84Q/19NNPm/s2m03nzp3T8uXLNWHCBL81BwAAECy69fHc8uXLNX78eO3bt0/t7e266667dODAAZ06dUp//etf/d0jAABAwHVrpiklJUVvvPGGrrnmGk2ePFltbW2aNm2aXn/9dX33u9/1d48AAAABd9EzTR0dHcrMzNTq1at133339URPAAAAQeeiZ5r69++v+vp62Wy2nugHAAAgKHXr47lbb71VTz75pL97AQAACFrdWgje3t6u3/72t6qqqtKoUaO6/ObcihUr/NIcAABAsLio0PTOO+/o8ssvV319vb7//e9Lkt5++22fGj62AwAAfdFFhaakpCQ1NjbqlVdekfTJz6b813/9l+Li4nqkOQAAgGBxUWuaDMPw2d+6dava2tr82hAAAEAw6tZC8PMuDFEAAAB91UWFJpvN1mXNEmuYAADAN8FFrWkyDEOzZs0yf5T3o48+0h133NHl23PPPfec/zoEAAAIAhcVmmbOnOmzf8stt/i1GQAAgGB1UaFpzZo1PdUHAABAUPtKC8EBAAC+KQhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsCGppeffVV5ebmyul0ymaz6fnnn/c5P2vWLPNHgs9vo0eP9qnxer2aP3++YmJiFB4erry8PB0/ftynxu12y+VyyeFwyOFwyOVy6fTp0z41x44dU25ursLDwxUTE6PCwkK1t7f3xLABAEAvFNDQ1NbWpquvvlplZWWfWzNlyhQ1Njaa25YtW3zOFxUVadOmTSovL1d1dbVaW1uVk5Ojzs5OsyY/P191dXWqqKhQRUWF6urq5HK5zPOdnZ3Kzs5WW1ubqqurVV5ero0bN6q4uNj/gwYAAL3SRf32nL9NnTpVU6dO/cIau92u+Pj4zzzn8Xj05JNPat26dZo0aZIkaf369UpMTNRLL72krKwsHTx4UBUVFaqpqVF6erok6YknnlBGRoYOHTqk5ORkVVZW6s0331RDQ4OcTqck6eGHH9asWbP0wAMPKDIy0o+jBgAAvVHQr2navn27YmNjNWzYMBUUFKi5udk8V1tbq46ODmVmZprHnE6nUlNTtXPnTknSrl275HA4zMAkSaNHj5bD4fCpSU1NNQOTJGVlZcnr9aq2tvZze/N6vWppafHZAABA3xTUoWnq1KnasGGDtm3bpocfflh79+7VddddJ6/XK0lqampSaGiooqKifB4XFxenpqYmsyY2NrbLtWNjY31q4uLifM5HRUUpNDTUrPkspaWl5joph8OhxMTErzReAAAQvAL68dyXmTFjhvl3amqqRo0apSFDhmjz5s2aNm3a5z7OMAzZbDZz/9N/f5WaCy1atEgLFy4091taWghOAAD0UUE903ShhIQEDRkyRIcPH5YkxcfHq729XW6326euubnZnDmKj4/XiRMnulzr5MmTPjUXzii53W51dHR0mYH6NLvdrsjISJ8NAAD0Tb0qNH344YdqaGhQQkKCJCktLU39+/dXVVWVWdPY2Kj6+nqNGTNGkpSRkSGPx6M9e/aYNbt375bH4/Gpqa+vV2Njo1lTWVkpu92utLS0r2NoAAAgyAX047nW1lb9/e9/N/ePHDmiuro6RUdHKzo6WosXL9aPfvQjJSQk6N1339W9996rmJgY3XTTTZIkh8OhOXPmqLi4WAMHDlR0dLRKSko0YsQI89t0w4cP15QpU1RQUKDVq1dLkm677Tbl5OQoOTlZkpSZmamUlBS5XC4tX75cp06dUklJiQoKCpg9AgAAkgIcmvbt26cJEyaY++fXB82cOVOrVq3S/v379fTTT+v06dNKSEjQhAkT9OyzzyoiIsJ8zMqVK9WvXz9Nnz5dZ8+e1cSJE7V27VqFhISYNRs2bFBhYaH5Lbu8vDyfe0OFhIRo8+bNmjt3rsaOHauwsDDl5+froYce6umXAAAA9BI2wzCMQDfRV7S0tMjhcMjj8fh9huryezb79Xpfh3eXZge6BQAAvpTV9+9etaYJAAAgUAhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYENDS9+uqrys3NldPplM1m0/PPP+9z3jAMLV68WE6nU2FhYRo/frwOHDjgU+P1ejV//nzFxMQoPDxceXl5On78uE+N2+2Wy+WSw+GQw+GQy+XS6dOnfWqOHTum3NxchYeHKyYmRoWFhWpvb++JYQMAgF4ooKGpra1NV199tcrKyj7z/LJly7RixQqVlZVp7969io+P1+TJk3XmzBmzpqioSJs2bVJ5ebmqq6vV2tqqnJwcdXZ2mjX5+fmqq6tTRUWFKioqVFdXJ5fLZZ7v7OxUdna22traVF1drfLycm3cuFHFxcU9N3gAANCr2AzDMALdhCTZbDZt2rRJN954o6RPZpmcTqeKiop09913S/pkVikuLk4PPvigbr/9dnk8Hl122WVat26dZsyYIUl6//33lZiYqC1btigrK0sHDx5USkqKampqlJ6eLkmqqalRRkaG3nrrLSUnJ2vr1q3KyclRQ0ODnE6nJKm8vFyzZs1Sc3OzIiMjLY2hpaVFDodDHo/H8mOsuvyezX693tfh3aXZgW4BAIAvZfX9O2jXNB05ckRNTU3KzMw0j9ntdo0bN047d+6UJNXW1qqjo8Onxul0KjU11azZtWuXHA6HGZgkafTo0XI4HD41qampZmCSpKysLHm9XtXW1n5uj16vVy0tLT4bAADom4I2NDU1NUmS4uLifI7HxcWZ55qamhQaGqqoqKgvrImNje1y/djYWJ+aC58nKipKoaGhZs1nKS0tNddJORwOJSYmXuQoAQBAbxG0oek8m83ms28YRpdjF7qw5rPqu1NzoUWLFsnj8ZhbQ0PDF/YFAAB6r6ANTfHx8ZLUZaanubnZnBWKj49Xe3u73G73F9acOHGiy/VPnjzpU3Ph87jdbnV0dHSZgfo0u92uyMhInw0AAPRNQRuahg4dqvj4eFVVVZnH2tvbtWPHDo0ZM0aSlJaWpv79+/vUNDY2qr6+3qzJyMiQx+PRnj17zJrdu3fL4/H41NTX16uxsdGsqayslN1uV1paWo+OEwAA9A79Avnkra2t+vvf/27uHzlyRHV1dYqOjtbgwYNVVFSkJUuWKCkpSUlJSVqyZIkGDBig/Px8SZLD4dCcOXNUXFysgQMHKjo6WiUlJRoxYoQmTZokSRo+fLimTJmigoICrV69WpJ02223KScnR8nJyZKkzMxMpaSkyOVyafny5Tp16pRKSkpUUFDA7BEAAJAU4NC0b98+TZgwwdxfuHChJGnmzJlau3at7rrrLp09e1Zz586V2+1Wenq6KisrFRERYT5m5cqV6tevn6ZPn66zZ89q4sSJWrt2rUJCQsyaDRs2qLCw0PyWXV5ens+9oUJCQrR582bNnTtXY8eOVVhYmPLz8/XQQw/19EsAAAB6iaC5T1NfwH2afHGfJgBAb9Dr79MEAAAQTAhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFvQLdAPouy6/Z3OgW7ho7y7NDnQLAIAgxUwTAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsCCoQ9PixYtls9l8tvj4ePO8YRhavHixnE6nwsLCNH78eB04cMDnGl6vV/Pnz1dMTIzCw8OVl5en48eP+9S43W65XC45HA45HA65XC6dPn366xgiAADoJYL+Pk1XXXWVXnrpJXM/JCTE/HvZsmVasWKF1q5dq2HDhun+++/X5MmTdejQIUVEREiSioqK9OKLL6q8vFwDBw5UcXGxcnJyVFtba14rPz9fx48fV0VFhSTptttuk8vl0osvvvg1jhTBgHtLAQA+T9CHpn79+vnMLp1nGIZ+85vf6Je//KWmTZsmSXrqqacUFxenZ555Rrfffrs8Ho+efPJJrVu3TpMmTZIkrV+/XomJiXrppZeUlZWlgwcPqqKiQjU1NUpPT5ckPfHEE8rIyNChQ4eUnJz89Q0WAAAEraD+eE6SDh8+LKfTqaFDh+rmm2/WO++8I0k6cuSImpqalJmZadba7XaNGzdOO3fulCTV1taqo6PDp8bpdCo1NdWs2bVrlxwOhxmYJGn06NFyOBxmzefxer1qaWnx2QAAQN8U1KEpPT1dTz/9tP785z/riSeeUFNTk8aMGaMPP/xQTU1NkqS4uDifx8TFxZnnmpqaFBoaqqioqC+siY2N7fLcsbGxZs3nKS0tNddBORwOJSYmdnusAAAguAV1aJo6dap+9KMfacSIEZo0aZI2b/5kvclTTz1l1thsNp/HGIbR5diFLqz5rHor11m0aJE8Ho+5NTQ0fOmYAABA7xTUoelC4eHhGjFihA4fPmyuc7pwNqi5udmcfYqPj1d7e7vcbvcX1pw4caLLc508ebLLLNaF7Ha7IiMjfTYAANA39arQ5PV6dfDgQSUkJGjo0KGKj49XVVWVeb69vV07duzQmDFjJElpaWnq37+/T01jY6Pq6+vNmoyMDHk8Hu3Zs8es2b17tzwej1kDAAAQ1N+eKykpUW5urgYPHqzm5mbdf//9amlp0cyZM2Wz2VRUVKQlS5YoKSlJSUlJWrJkiQYMGKD8/HxJksPh0Jw5c1RcXKyBAwcqOjpaJSUl5sd9kjR8+HBNmTJFBQUFWr16taRPbjmQk5PDN+cAAIApqEPT8ePH9ZOf/EQffPCBLrvsMo0ePVo1NTUaMmSIJOmuu+7S2bNnNXfuXLndbqWnp6uystK8R5MkrVy5Uv369dP06dN19uxZTZw4UWvXrvW539OGDRtUWFhofssuLy9PZWVlX+9gAQBAULMZhmEEuom+oqWlRQ6HQx6Px+/rm3rjTRfx9eDmlgDw1Vh9/+5Va5oAAAAChdAEAABgAaEJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABYQmgAAACwgNAEAAFhAaAIAALCgX6AbAPDVXH7P5kC30C3vLs0OdAsAcFGYaQIAALCA0AQAAGABoQkAAMACQhMAAIAFhCYAAAALCE0AAAAWEJoAAAAsIDQBAABYQGgCAACwgNAEAABgAaEJAADAAn57DkBA9MbfzOP38oBvNmaaAAAALCA0AQAAWEBoAgAAsIDQdIFHH31UQ4cO1aWXXqq0tDT95S9/CXRLAAAgCLAQ/FOeffZZFRUV6dFHH9XYsWO1evVqTZ06VW+++aYGDx4c6PYABBiL14FvNmaaPmXFihWaM2eOfv7zn2v48OH6zW9+o8TERK1atSrQrQEAgABjpun/tLe3q7a2Vvfcc4/P8czMTO3cufMzH+P1euX1es19j8cjSWppafF7f+e8//T7NQH0fYN/8YdAt3DR6u/LCnQL+IY5/75tGMYX1hGa/s8HH3ygzs5OxcXF+RyPi4tTU1PTZz6mtLRU9913X5fjiYmJPdIjAHwTOH4T6A7wTXXmzBk5HI7PPU9ouoDNZvPZNwyjy7HzFi1apIULF5r7586d06lTpzRw4MDPfUx3tLS0KDExUQ0NDYqMjPTbdYMN4+xbvgnj/CaMUWKcfQ3j7MowDJ05c0ZOp/ML6whN/ycmJkYhISFdZpWam5u7zD6dZ7fbZbfbfY59+9vf7qkWFRkZ2af/Bz+PcfYt34RxfhPGKDHOvoZx+vqiGabzWAj+f0JDQ5WWlqaqqiqf41VVVRozZkyAugIAAMGCmaZPWbhwoVwul0aNGqWMjAw9/vjjOnbsmO64445AtwYAAAKM0PQpM2bM0Icffqhf//rXamxsVGpqqrZs2aIhQ4YEtC+73a5f/epXXT4K7GsYZ9/yTRjnN2GMEuPsaxhn99mML/t+HQAAAFjTBAAAYAWhCQAAwAJCEwAAgAWEJgAAAAsITb3Ao48+qqFDh+rSSy9VWlqa/vKXvwS6Jb969dVXlZubK6fTKZvNpueffz7QLfldaWmpfvCDHygiIkKxsbG68cYbdejQoUC35XerVq3SyJEjzZvJZWRkaOvWrYFuq8eVlpbKZrOpqKgo0K341eLFi2Wz2Xy2+Pj4QLfVI9577z3dcsstGjhwoAYMGKB//dd/VW1tbaDb8qvLL7+8y39Pm82mefPmBbo1v/n444/1//7f/9PQoUMVFham73znO/r1r3+tc+fO+eX6hKYg9+yzz6qoqEi//OUv9frrr+vf/u3fNHXqVB07dizQrflNW1ubrr76apWVlQW6lR6zY8cOzZs3TzU1NaqqqtLHH3+szMxMtbW1Bbo1vxo0aJCWLl2qffv2ad++fbruuut0ww036MCBA4Furcfs3btXjz/+uEaOHBnoVnrEVVddpcbGRnPbv39/oFvyO7fbrbFjx6p///7aunWr3nzzTT388MM9+gsPgbB3716f/5bnb+b84x//OMCd+c+DDz6oxx57TGVlZTp48KCWLVum5cuX67//+7/98wQGgto111xj3HHHHT7HrrzySuOee+4JUEc9S5KxadOmQLfR45qbmw1Jxo4dOwLdSo+Liooyfvvb3wa6jR5x5swZIykpyaiqqjLGjRtnLFiwINAt+dWvfvUr4+qrrw50Gz3u7rvvNq699tpAt/G1W7BggfHd737XOHfuXKBb8Zvs7Gxj9uzZPsemTZtm3HLLLX65PjNNQay9vV21tbXKzMz0OZ6ZmamdO3cGqCv4g8fjkSRFR0cHuJOe09nZqfLycrW1tSkjIyPQ7fSIefPmKTs7W5MmTQp0Kz3m8OHDcjqdGjp0qG6++Wa98847gW7J71544QWNGjVKP/7xjxUbG6vvfe97euKJJwLdVo9qb2/X+vXrNXv2bL/+wHygXXvttXr55Zf19ttvS5L+9re/qbq6Wtdff71frs8dwYPYBx98oM7Ozi4/GBwXF9flh4XRexiGoYULF+raa69VampqoNvxu/379ysjI0MfffSRvvWtb2nTpk1KSUkJdFt+V15ertdee0179+4NdCs9Jj09XU8//bSGDRumEydO6P7779eYMWN04MABDRw4MNDt+c0777yjVatWaeHChbr33nu1Z88eFRYWym6369Zbbw10ez3i+eef1+nTpzVr1qxAt+JXd999tzwej6688kqFhISos7NTDzzwgH7yk5/45fqEpl7gwn8FGIbRp/5l8E1z55136o033lB1dXWgW+kRycnJqqur0+nTp7Vx40bNnDlTO3bs6FPBqaGhQQsWLFBlZaUuvfTSQLfTY6ZOnWr+PWLECGVkZOi73/2unnrqKS1cuDCAnfnXuXPnNGrUKC1ZskSS9L3vfU8HDhzQqlWr+mxoevLJJzV16lQ5nc5At+JXzz77rNavX69nnnlGV111lerq6lRUVCSn06mZM2d+5esTmoJYTEyMQkJCuswqNTc3d5l9Qu8wf/58vfDCC3r11Vc1aNCgQLfTI0JDQ3XFFVdIkkaNGqW9e/fqkUce0erVqwPcmf/U1taqublZaWlp5rHOzk69+uqrKisrk9frVUhISAA77Bnh4eEaMWKEDh8+HOhW/CohIaFLqB8+fLg2btwYoI561tGjR/XSSy/pueeeC3Qrfvcf//Efuueee3TzzTdL+iTsHz16VKWlpX4JTaxpCmKhoaFKS0szv+FwXlVVlcaMGROgrtAdhmHozjvv1HPPPadt27Zp6NChgW7pa2MYhrxeb6Db8KuJEydq//79qqurM7dRo0bppz/9qerq6vpkYJIkr9ergwcPKiEhIdCt+NXYsWO73ALk7bffDviPtfeUNWvWKDY2VtnZ2YFuxe/++c9/6pJLfKNNSEiI3245wExTkFu4cKFcLpdGjRqljIwMPf744zp27JjuuOOOQLfmN62trfr73/9u7h85ckR1dXWKjo7W4MGDA9iZ/8ybN0/PPPOM/vSnPykiIsKcPXQ4HAoLCwtwd/5z7733aurUqUpMTNSZM2dUXl6u7du3q6KiItCt+VVERESX9Wjh4eEaOHBgn1qnVlJSotzcXA0ePFjNzc26//771dLS4pd/sQeTX/ziFxozZoyWLFmi6dOna8+ePXr88cf1+OOPB7o1vzt37pzWrFmjmTNnql+/vhcBcnNz9cADD2jw4MG66qqr9Prrr2vFihWaPXu2f57AL9/BQ4/6n//5H2PIkCFGaGio8f3vf7/PfU39lVdeMSR12WbOnBno1vzms8YnyVizZk2gW/Or2bNnm/+vXnbZZcbEiRONysrKQLf1teiLtxyYMWOGkZCQYPTv399wOp3GtGnTjAMHDgS6rR7x4osvGqmpqYbdbjeuvPJK4/HHHw90Sz3iz3/+syHJOHToUKBb6REtLS3GggULjMGDBxuXXnqp8Z3vfMf45S9/aXi9Xr9c32YYhuGf+AUAANB3saYJAADAAkITAACABYQmAAAACwhNAAAAFhCaAAAALCA0AQAAWEBoAgAAsIDQBAAAYAGhCQAAwAJCEwAAgAWEJgAAAAsITQAAABb8f7BEN+bQvXbhAAAAAElFTkSuQmCC",
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "df[\"log_votes\"].plot.hist()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check the number of missing values for each columm below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText    9\n",
      "summary       7\n",
      "verified      0\n",
      "time          0\n",
      "rating        0\n",
      "log_votes     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Text Processing: Stop words removal and stemming</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/studio-lab-\n",
      "[nltk_data]     user/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package stopwords to /home/studio-lab-\n",
      "[nltk_data]     user/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Install the library and functions\n",
    "import nltk\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list (because those words are actually useful to understand the sentiment in the sentence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import nltk, re\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import SnowballStemmer\n",
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "# Let's get a list of stop words from the NLTK library\n",
    "stop = stopwords.words('english')\n",
    "\n",
    "# These words are important for our problem. We don't want to remove them.\n",
    "excluding = ['against', 'not', 'don', \"don't\",'ain', 'aren', \"aren't\", 'couldn', \"couldn't\",\n",
    "             'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", \n",
    "             'haven', \"haven't\", 'isn', \"isn't\", 'mightn', \"mightn't\", 'mustn', \"mustn't\",\n",
    "             'needn', \"needn't\",'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \n",
    "             \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n",
    "\n",
    "# New stop word list\n",
    "stop_words = [word for word in stop if word not in excluding]\n",
    "\n",
    "snow = SnowballStemmer('english')\n",
    "\n",
    "def process_text(texts): \n",
    "    final_text_list=[]\n",
    "    for sent in texts:\n",
    "        \n",
    "        # Check if the sentence is a missing value\n",
    "        if isinstance(sent, str) == False:\n",
    "            sent = \"\"\n",
    "            \n",
    "        filtered_sentence=[]\n",
    "        \n",
    "        sent = sent.lower() # Lowercase \n",
    "        sent = sent.strip() # Remove leading/trailing whitespace\n",
    "        sent = re.sub('\\s+', ' ', sent) # Remove extra space and tabs\n",
    "        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:\n",
    "        \n",
    "        for w in word_tokenize(sent):\n",
    "            # We are applying some custom filtering here, feel free to try different things\n",
    "            # Check if it is not numeric and its length>2 and not in stop words\n",
    "            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  \n",
    "                # Stem and add to filtered list\n",
    "                filtered_sentence.append(snow.stem(w))\n",
    "        final_string = \" \".join(filtered_sentence) #final string of cleaned words\n",
    " \n",
    "        final_text_list.append(final_string)\n",
    "        \n",
    "    return final_text_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Train - Validation Split</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's split our dataset into training (90%) and validation (10%). We will use \"reviewText\", \"summary\", \"time\", \"rating\" fields and predict the \"log_votes\" field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "X_train, X_val, y_train, y_val = train_test_split(df[[\"reviewText\", \"summary\", \"time\", \"rating\"]],\n",
    "                                                  df[\"log_votes\"],\n",
    "                                                  test_size=0.10,\n",
    "                                                  shuffle=True,\n",
    "                                                  random_state=324\n",
    "                                                 )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing the reviewText fields\n",
      "Processing the summary fields\n"
     ]
    }
   ],
   "source": [
    "print(\"Processing the reviewText fields\")\n",
    "X_train[\"reviewText\"] = process_text(X_train[\"reviewText\"].tolist())\n",
    "X_val[\"reviewText\"] = process_text(X_val[\"reviewText\"].tolist())\n",
    "\n",
    "print(\"Processing the summary fields\")\n",
    "X_train[\"summary\"] = process_text(X_train[\"summary\"].tolist())\n",
    "X_val[\"summary\"] = process_text(X_val[\"summary\"].tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our __process_text()__ method in section 3 uses empty string for missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Data processing with Pipeline and ColumnTransform</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. We are using linear regression model from Sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model. \n",
    "\n",
    "   * For the numerical features pipeline, the __numerical_processor__ below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n",
    "   * For the numerical features pipeline, the __text_processor__ below, we use CountVectorizer() for the text fields.\n",
    "   \n",
    "The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Grab model features/inputs and target/output\n",
    "numerical_features = ['time',\n",
    "                      'rating']\n",
    "\n",
    "text_features = ['summary',\n",
    "                 'reviewText']\n",
    "\n",
    "model_features = numerical_features + text_features\n",
    "model_target = 'log_votes'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: black;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-1 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-1 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-1 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: block;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-1 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-1 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-1 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-1 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 1ex;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-1 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-1 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;rating&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;lr&#x27;, LinearRegression())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;&nbsp;Pipeline<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.pipeline.Pipeline.html\">?<span>Documentation for Pipeline</span></a><span class=\"sk-estimator-doc-link \">i<span>Not fitted</span></span></label><div class=\"sk-toggleable__content \"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;rating&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;lr&#x27;, LinearRegression())])</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;data_preprocessing: ColumnTransformer<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.compose.ColumnTransformer.html\">?<span>Documentation for data_preprocessing: ColumnTransformer</span></a></label><div class=\"sk-toggleable__content \"><pre>ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;time&#x27;, &#x27;rating&#x27;]),\n",
       "                                (&#x27;text_pre_0&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 &#x27;summary&#x27;),\n",
       "                                (&#x27;text_pre_1&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 &#x27;reviewText&#x27;)])</pre></div> </div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">numerical_pre</label><div class=\"sk-toggleable__content \"><pre>[&#x27;time&#x27;, &#x27;rating&#x27;]</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;MinMaxScaler<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.MinMaxScaler.html\">?<span>Documentation for MinMaxScaler</span></a></label><div class=\"sk-toggleable__content \"><pre>MinMaxScaler()</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">text_pre_0</label><div class=\"sk-toggleable__content \"><pre>summary</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content \"><pre>CountVectorizer(binary=True, max_features=50)</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-7\" type=\"checkbox\" ><label for=\"sk-estimator-id-7\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">text_pre_1</label><div class=\"sk-toggleable__content \"><pre>reviewText</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-8\" type=\"checkbox\" ><label for=\"sk-estimator-id-8\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content \"><pre>CountVectorizer(binary=True, max_features=150)</pre></div> </div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator  sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-9\" type=\"checkbox\" ><label for=\"sk-estimator-id-9\" class=\"sk-toggleable__label  sk-toggleable__label-arrow \">&nbsp;LinearRegression<a class=\"sk-estimator-doc-link \" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.linear_model.LinearRegression.html\">?<span>Documentation for LinearRegression</span></a></label><div class=\"sk-toggleable__content \"><pre>LinearRegression()</pre></div> </div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'rating']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('lr', LinearRegression())])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "### COLUMN_TRANSFORMER ###\n",
    "##########################\n",
    "\n",
    "# Preprocess the numerical features\n",
    "numerical_processor = Pipeline([\n",
    "    ('num_scaler', MinMaxScaler())\n",
    "])\n",
    "# Preprocess 1st text feature\n",
    "text_processor_0 = Pipeline([\n",
    "    ('text_vect_0', CountVectorizer(binary=True, max_features=50))\n",
    "])\n",
    "\n",
    "# Preprocess 2nd text feature (larger vocabulary)\n",
    "text_precessor_1 = Pipeline([\n",
    "    ('text_vect_1', CountVectorizer(binary=True, max_features=150))\n",
    "])\n",
    "\n",
    "# Combine all data preprocessors from above (add more, if you choose to define more!)\n",
    "# For each processor/step specify: a name, the actual process, and finally the features to be processed\n",
    "data_preprocessor = ColumnTransformer([\n",
    "    ('numerical_pre', numerical_processor, numerical_features),\n",
    "    ('text_pre_0', text_processor_0, text_features[0]),\n",
    "    ('text_pre_1', text_precessor_1, text_features[1])\n",
    "]) \n",
    "\n",
    "### PIPELINE ###\n",
    "################\n",
    "\n",
    "# Pipeline desired all data transformers, along with an estimator at the end\n",
    "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
    "pipeline = Pipeline([\n",
    "    ('data_preprocessing', data_preprocessor),\n",
    "    ('lr', LinearRegression())\n",
    "])\n",
    "\n",
    "# Visualize the pipeline\n",
    "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
    "from sklearn import set_config\n",
    "set_config(display='diagram')\n",
    "pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. <a name=\"6\">Train the regressor</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We train our model by using __.fit()__ on our training dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {\n",
       "  /* Definition of color scheme common for light and dark mode */\n",
       "  --sklearn-color-text: black;\n",
       "  --sklearn-color-line: gray;\n",
       "  /* Definition of color scheme for unfitted estimators */\n",
       "  --sklearn-color-unfitted-level-0: #fff5e6;\n",
       "  --sklearn-color-unfitted-level-1: #f6e4d2;\n",
       "  --sklearn-color-unfitted-level-2: #ffe0b3;\n",
       "  --sklearn-color-unfitted-level-3: chocolate;\n",
       "  /* Definition of color scheme for fitted estimators */\n",
       "  --sklearn-color-fitted-level-0: #f0f8ff;\n",
       "  --sklearn-color-fitted-level-1: #d4ebff;\n",
       "  --sklearn-color-fitted-level-2: #b3dbfd;\n",
       "  --sklearn-color-fitted-level-3: cornflowerblue;\n",
       "\n",
       "  /* Specific color for light theme */\n",
       "  --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
       "  --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
       "  --sklearn-color-icon: #696969;\n",
       "\n",
       "  @media (prefers-color-scheme: dark) {\n",
       "    /* Redefinition of color scheme for dark theme */\n",
       "    --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
       "    --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
       "    --sklearn-color-icon: #878787;\n",
       "  }\n",
       "}\n",
       "\n",
       "#sk-container-id-2 {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 pre {\n",
       "  padding: 0;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 input.sk-hidden--visually {\n",
       "  border: 0;\n",
       "  clip: rect(1px 1px 1px 1px);\n",
       "  clip: rect(1px, 1px, 1px, 1px);\n",
       "  height: 1px;\n",
       "  margin: -1px;\n",
       "  overflow: hidden;\n",
       "  padding: 0;\n",
       "  position: absolute;\n",
       "  width: 1px;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-dashed-wrapped {\n",
       "  border: 1px dashed var(--sklearn-color-line);\n",
       "  margin: 0 0.4em 0.5em 0.4em;\n",
       "  box-sizing: border-box;\n",
       "  padding-bottom: 0.4em;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-container {\n",
       "  /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
       "     but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
       "     so we also need the `!important` here to be able to override the\n",
       "     default hidden behavior on the sphinx rendered scikit-learn.org.\n",
       "     See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
       "  display: inline-block !important;\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-text-repr-fallback {\n",
       "  display: none;\n",
       "}\n",
       "\n",
       "div.sk-parallel-item,\n",
       "div.sk-serial,\n",
       "div.sk-item {\n",
       "  /* draw centered vertical line to link estimators */\n",
       "  background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
       "  background-size: 2px 100%;\n",
       "  background-repeat: no-repeat;\n",
       "  background-position: center center;\n",
       "}\n",
       "\n",
       "/* Parallel-specific style estimator block */\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item::after {\n",
       "  content: \"\";\n",
       "  width: 100%;\n",
       "  border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
       "  flex-grow: 1;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel {\n",
       "  display: flex;\n",
       "  align-items: stretch;\n",
       "  justify-content: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  position: relative;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item:first-child::after {\n",
       "  align-self: flex-end;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item:last-child::after {\n",
       "  align-self: flex-start;\n",
       "  width: 50%;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-parallel-item:only-child::after {\n",
       "  width: 0;\n",
       "}\n",
       "\n",
       "/* Serial-specific style estimator block */\n",
       "\n",
       "#sk-container-id-2 div.sk-serial {\n",
       "  display: flex;\n",
       "  flex-direction: column;\n",
       "  align-items: center;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  padding-right: 1em;\n",
       "  padding-left: 1em;\n",
       "}\n",
       "\n",
       "\n",
       "/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
       "clickable and can be expanded/collapsed.\n",
       "- Pipeline and ColumnTransformer use this feature and define the default style\n",
       "- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
       "*/\n",
       "\n",
       "/* Pipeline and ColumnTransformer style (default) */\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable {\n",
       "  /* Default theme specific background. It is overwritten whether we have a\n",
       "  specific estimator or a Pipeline/ColumnTransformer */\n",
       "  background-color: var(--sklearn-color-background);\n",
       "}\n",
       "\n",
       "/* Toggleable label */\n",
       "#sk-container-id-2 label.sk-toggleable__label {\n",
       "  cursor: pointer;\n",
       "  display: block;\n",
       "  width: 100%;\n",
       "  margin-bottom: 0;\n",
       "  padding: 0.5em;\n",
       "  box-sizing: border-box;\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 label.sk-toggleable__label-arrow:before {\n",
       "  /* Arrow on the left of the label */\n",
       "  content: \"▸\";\n",
       "  float: left;\n",
       "  margin-right: 0.25em;\n",
       "  color: var(--sklearn-color-icon);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {\n",
       "  color: var(--sklearn-color-text);\n",
       "}\n",
       "\n",
       "/* Toggleable content - dropdown */\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content {\n",
       "  max-height: 0;\n",
       "  max-width: 0;\n",
       "  overflow: hidden;\n",
       "  text-align: left;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content pre {\n",
       "  margin: 0.2em;\n",
       "  border-radius: 0.25em;\n",
       "  color: var(--sklearn-color-text);\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-toggleable__content.fitted pre {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
       "  /* Expand drop-down */\n",
       "  max-height: 200px;\n",
       "  max-width: 100%;\n",
       "  overflow: auto;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
       "  content: \"▾\";\n",
       "}\n",
       "\n",
       "/* Pipeline/ColumnTransformer-specific style */\n",
       "\n",
       "#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator-specific style */\n",
       "\n",
       "/* Colorize estimator box */\n",
       "#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-label label.sk-toggleable__label,\n",
       "#sk-container-id-2 div.sk-label label {\n",
       "  /* The background is the default theme color */\n",
       "  color: var(--sklearn-color-text-on-default-background);\n",
       "}\n",
       "\n",
       "/* On hover, darken the color of the background */\n",
       "#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "/* Label box, darken color on hover, fitted */\n",
       "#sk-container-id-2 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
       "  color: var(--sklearn-color-text);\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Estimator label */\n",
       "\n",
       "#sk-container-id-2 div.sk-label label {\n",
       "  font-family: monospace;\n",
       "  font-weight: bold;\n",
       "  display: inline-block;\n",
       "  line-height: 1.2em;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-label-container {\n",
       "  text-align: center;\n",
       "}\n",
       "\n",
       "/* Estimator-specific */\n",
       "#sk-container-id-2 div.sk-estimator {\n",
       "  font-family: monospace;\n",
       "  border: 1px dotted var(--sklearn-color-border-box);\n",
       "  border-radius: 0.25em;\n",
       "  box-sizing: border-box;\n",
       "  margin-bottom: 0.5em;\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-0);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-estimator.fitted {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-0);\n",
       "}\n",
       "\n",
       "/* on hover */\n",
       "#sk-container-id-2 div.sk-estimator:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-2);\n",
       "}\n",
       "\n",
       "#sk-container-id-2 div.sk-estimator.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-2);\n",
       "}\n",
       "\n",
       "/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
       "\n",
       "/* Common style for \"i\" and \"?\" */\n",
       "\n",
       ".sk-estimator-doc-link,\n",
       "a:link.sk-estimator-doc-link,\n",
       "a:visited.sk-estimator-doc-link {\n",
       "  float: right;\n",
       "  font-size: smaller;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1em;\n",
       "  height: 1em;\n",
       "  width: 1em;\n",
       "  text-decoration: none !important;\n",
       "  margin-left: 1ex;\n",
       "  /* unfitted */\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted,\n",
       "a:link.sk-estimator-doc-link.fitted,\n",
       "a:visited.sk-estimator-doc-link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
       ".sk-estimator-doc-link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover,\n",
       "div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
       ".sk-estimator-doc-link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "/* Span, style for the box shown on hovering the info icon */\n",
       ".sk-estimator-doc-link span {\n",
       "  display: none;\n",
       "  z-index: 9999;\n",
       "  position: relative;\n",
       "  font-weight: normal;\n",
       "  right: .2ex;\n",
       "  padding: .5ex;\n",
       "  margin: .5ex;\n",
       "  width: min-content;\n",
       "  min-width: 20ex;\n",
       "  max-width: 50ex;\n",
       "  color: var(--sklearn-color-text);\n",
       "  box-shadow: 2pt 2pt 4pt #999;\n",
       "  /* unfitted */\n",
       "  background: var(--sklearn-color-unfitted-level-0);\n",
       "  border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link.fitted span {\n",
       "  /* fitted */\n",
       "  background: var(--sklearn-color-fitted-level-0);\n",
       "  border: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "\n",
       ".sk-estimator-doc-link:hover span {\n",
       "  display: block;\n",
       "}\n",
       "\n",
       "/* \"?\"-specific style due to the `<a>` HTML tag */\n",
       "\n",
       "#sk-container-id-2 a.estimator_doc_link {\n",
       "  float: right;\n",
       "  font-size: 1rem;\n",
       "  line-height: 1em;\n",
       "  font-family: monospace;\n",
       "  background-color: var(--sklearn-color-background);\n",
       "  border-radius: 1rem;\n",
       "  height: 1rem;\n",
       "  width: 1rem;\n",
       "  text-decoration: none;\n",
       "  /* unfitted */\n",
       "  color: var(--sklearn-color-unfitted-level-1);\n",
       "  border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 a.estimator_doc_link.fitted {\n",
       "  /* fitted */\n",
       "  border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
       "  color: var(--sklearn-color-fitted-level-1);\n",
       "}\n",
       "\n",
       "/* On hover */\n",
       "#sk-container-id-2 a.estimator_doc_link:hover {\n",
       "  /* unfitted */\n",
       "  background-color: var(--sklearn-color-unfitted-level-3);\n",
       "  color: var(--sklearn-color-background);\n",
       "  text-decoration: none;\n",
       "}\n",
       "\n",
       "#sk-container-id-2 a.estimator_doc_link.fitted:hover {\n",
       "  /* fitted */\n",
       "  background-color: var(--sklearn-color-fitted-level-3);\n",
       "}\n",
       "</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;rating&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;lr&#x27;, LinearRegression())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-10\" type=\"checkbox\" ><label for=\"sk-estimator-id-10\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;&nbsp;Pipeline<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.pipeline.Pipeline.html\">?<span>Documentation for Pipeline</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>Pipeline(steps=[(&#x27;data_preprocessing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;time&#x27;, &#x27;rating&#x27;]),\n",
       "                                                 (&#x27;text_pre_0&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  &#x27;summary&#x27;),\n",
       "                                                 (&#x27;text_pre_1&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  &#x27;reviewText&#x27;)])),\n",
       "                (&#x27;lr&#x27;, LinearRegression())])</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-11\" type=\"checkbox\" ><label for=\"sk-estimator-id-11\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;data_preprocessing: ColumnTransformer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.compose.ColumnTransformer.html\">?<span>Documentation for data_preprocessing: ColumnTransformer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>ColumnTransformer(transformers=[(&#x27;numerical_pre&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;time&#x27;, &#x27;rating&#x27;]),\n",
       "                                (&#x27;text_pre_0&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_0&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=50))]),\n",
       "                                 &#x27;summary&#x27;),\n",
       "                                (&#x27;text_pre_1&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;text_vect_1&#x27;,\n",
       "                                                  CountVectorizer(binary=True,\n",
       "                                                                  max_features=150))]),\n",
       "                                 &#x27;reviewText&#x27;)])</pre></div> </div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-12\" type=\"checkbox\" ><label for=\"sk-estimator-id-12\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">numerical_pre</label><div class=\"sk-toggleable__content fitted\"><pre>[&#x27;time&#x27;, &#x27;rating&#x27;]</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-13\" type=\"checkbox\" ><label for=\"sk-estimator-id-13\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;MinMaxScaler<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.preprocessing.MinMaxScaler.html\">?<span>Documentation for MinMaxScaler</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>MinMaxScaler()</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-14\" type=\"checkbox\" ><label for=\"sk-estimator-id-14\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">text_pre_0</label><div class=\"sk-toggleable__content fitted\"><pre>summary</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-15\" type=\"checkbox\" ><label for=\"sk-estimator-id-15\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>CountVectorizer(binary=True, max_features=50)</pre></div> </div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-16\" type=\"checkbox\" ><label for=\"sk-estimator-id-16\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">text_pre_1</label><div class=\"sk-toggleable__content fitted\"><pre>reviewText</pre></div> </div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-17\" type=\"checkbox\" ><label for=\"sk-estimator-id-17\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;CountVectorizer<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\">?<span>Documentation for CountVectorizer</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>CountVectorizer(binary=True, max_features=150)</pre></div> </div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-18\" type=\"checkbox\" ><label for=\"sk-estimator-id-18\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\">&nbsp;LinearRegression<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.4/modules/generated/sklearn.linear_model.LinearRegression.html\">?<span>Documentation for LinearRegression</span></a></label><div class=\"sk-toggleable__content fitted\"><pre>LinearRegression()</pre></div> </div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_preprocessing',\n",
       "                 ColumnTransformer(transformers=[('numerical_pre',\n",
       "                                                  Pipeline(steps=[('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['time', 'rating']),\n",
       "                                                 ('text_pre_0',\n",
       "                                                  Pipeline(steps=[('text_vect_0',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=50))]),\n",
       "                                                  'summary'),\n",
       "                                                 ('text_pre_1',\n",
       "                                                  Pipeline(steps=[('text_vect_1',\n",
       "                                                                   CountVectorizer(binary=True,\n",
       "                                                                                   max_features=150))]),\n",
       "                                                  'reviewText')])),\n",
       "                ('lr', LinearRegression())])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fit the Pipeline to training data\n",
    "pipeline.fit(X_train[model_features], y_train.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. <a name=\"7\">Fitting Linear Regression models and checking the validation performance</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.1  LinearRegression\n",
    "Let's first fit __LinearRegression__ from Sklearn library, and check the performance on the validation dataset. Using the __coef___ atribute, we can also print the learned weights of the model.\n",
    "\n",
    "Find more details on __LinearRegression__ here:\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LinearRegression on Validation: Mean_squared_error: 0.591002,  R_square_score: 0.356090\n",
      "LinearRegression model weights: \n",
      " [-1.74577310e+00 -4.16698110e-01  6.55550993e-02 -2.35263334e-02\n",
      "  8.61218902e-02 -1.01349742e-02  7.93538441e-02  1.00580439e-01\n",
      "  5.29632578e-02  2.05178698e-02  5.73210227e-02  2.22065363e-01\n",
      "  9.93748268e-02 -1.37704322e-02 -4.06300146e-02 -1.02485740e-04\n",
      "  4.40500748e-02  9.05559899e-02  2.17964474e-02  1.26253631e-02\n",
      " -2.25983382e-02  8.51511565e-03  2.46082793e-02 -2.33539043e-02\n",
      " -3.30885517e-02 -9.93542018e-03 -1.51318556e-01 -3.76403328e-02\n",
      "  6.49346208e-02 -1.17076202e-02  1.08907076e-02  1.58695833e-02\n",
      " -4.96627026e-02 -5.60340140e-02  5.79982972e-02 -8.74184116e-02\n",
      "  1.68562800e-02 -4.05454093e-02  2.67896146e-02  4.66004001e-02\n",
      " -9.30084557e-03  1.16290987e-01  2.62059795e-02 -2.14182503e-02\n",
      " -1.19426267e-02 -4.02101489e-02 -5.45595262e-02 -1.20283335e-01\n",
      "  1.28811837e-02  5.21457746e-02 -1.36324220e-02  8.86599339e-02\n",
      "  5.42085193e-03 -4.70405229e-02  9.00121081e-02  6.50392728e-02\n",
      "  6.40212196e-02 -4.80338215e-02  5.21236752e-02  2.77770462e-02\n",
      " -1.33471310e-02  2.49327440e-02  1.52714674e-02  4.96134212e-03\n",
      "  2.69372360e-02  4.02838210e-02  3.99137603e-02  1.09036285e-01\n",
      "  6.67874011e-02  8.41275456e-02  8.75869836e-02 -1.99734329e-02\n",
      " -4.10804254e-02  1.20796535e-01  1.03699028e-01  2.50627460e-02\n",
      "  5.49380665e-02  1.33983599e-02  9.03278988e-04  8.61981316e-03\n",
      "  1.03041931e-02  1.76766645e-02  1.58122505e-02  3.10935598e-02\n",
      "  8.68813615e-02  2.62194555e-02  6.90467186e-02  9.47777098e-03\n",
      " -5.32785617e-03 -2.52699551e-02  7.86174433e-02  4.59835993e-02\n",
      "  2.21098931e-02  1.36483605e-02 -1.70523024e-02  1.81253523e-02\n",
      "  2.28482750e-02  3.49091617e-02  1.92502167e-03 -7.10641744e-03\n",
      "  5.47337156e-03 -1.32635623e-02  1.70599363e-02  6.10153323e-02\n",
      " -5.83119090e-03  6.08770396e-02  4.66500788e-03  1.25991448e-01\n",
      "  2.07886646e-02  3.18106750e-02  1.37392308e-01  3.61291354e-04\n",
      "  2.82124080e-03 -1.89556554e-02  2.59797154e-02  1.17412985e-01\n",
      "  1.52368169e-02 -3.92001870e-02  1.76707025e-02  3.49414043e-02\n",
      "  1.10495833e-01 -2.62582010e-02  1.74359183e-02 -2.61435155e-02\n",
      "  2.30332886e-02  5.66529822e-02  5.47466966e-02  2.12466667e-02\n",
      "  7.42131772e-02  5.54873813e-02  2.96636571e-02  2.42996666e-02\n",
      "  1.24574321e-02 -3.66268408e-02  5.10833203e-02 -9.19411359e-02\n",
      "  2.57368798e-02 -1.59822282e-02  4.48797639e-02 -4.43224104e-02\n",
      " -8.83562415e-04  1.04401796e-02  4.80104357e-02  6.02076962e-02\n",
      "  3.17555550e-02  9.05581309e-03 -3.71823458e-02  6.11067941e-03\n",
      "  5.73208893e-02  6.00441813e-02  3.66196699e-02  1.28905943e-02\n",
      " -8.39220271e-02  9.64029814e-02  6.08003681e-02  1.50606849e-02\n",
      "  2.91221862e-02  1.49474074e-02  1.21162046e-01  3.57856940e-02\n",
      "  1.68265865e-02 -3.11500411e-02  1.75313872e-02  4.75842133e-02\n",
      " -3.34244198e-02 -5.20258066e-02  1.90841491e-02 -2.86798848e-02\n",
      "  9.26663717e-03  4.65491052e-02  3.57855077e-02 -3.84353230e-02\n",
      "  1.98151991e-02 -7.98395094e-02  3.81666203e-05  4.44709478e-02\n",
      " -1.08129183e-02 -7.87826489e-03 -1.15665414e-02 -1.11209221e-01\n",
      "  1.54511088e-02  8.73721325e-03  2.00030501e-02 -2.14024533e-02\n",
      "  4.16407553e-03  3.86778367e-02  1.37940881e-02  6.06018659e-02\n",
      "  7.20187719e-03  3.61070048e-02  6.47848635e-02  5.97981717e-02\n",
      "  5.21147942e-02  4.88814393e-02  1.04422476e-02  2.96484035e-02\n",
      "  5.59812706e-02  1.21381675e-01  1.10408279e-03  4.90792463e-03\n",
      " -2.28285873e-02  1.79462628e-02]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import r2_score, mean_squared_error\n",
    "\n",
    "lrRegressor_val_predictions = pipeline.predict(X_val[model_features])\n",
    "print(\"LinearRegression on Validation: Mean_squared_error: %f,  R_square_score: %f\" % \\\n",
    "      (mean_squared_error(y_val, lrRegressor_val_predictions),r2_score(y_val, lrRegressor_val_predictions)))\n",
    "print(\"LinearRegression model weights: \\n\", pipeline.named_steps['lr'].coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.2  Ridge (Linear Regression with L2 regularization)\n",
    "Let's now fit __Ridge__ from Sklearn library, and check the performance on the validation dataset.\n",
    "\n",
    "Find more details on __Ridge__ here:\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html\n",
    "\n",
    "To improve the performance of a LinearRegression model, __Ridge__ is tuning model complexity by adding a $L_2$ penalty score for complexity to the model cost function:\n",
    "\n",
    "$$\\text{C}_{\\text{regularized}}(\\textbf{w}) = \\text{C}(\\textbf{w}) +  {alpha}∗||\\textbf{w}||_2^2$$\n",
    "\n",
    "where $\\textbf{w}$ is the model weights vector, and $||\\textbf{w}||_2^2 = \\sum \\textbf{w}_i^2$.\n",
    "\n",
    "The strength of the regularization is controlled by the regularizer parameter, alpha: smaller value of $alpha$, weaker regularization; larger value of $alpha$, stronger regularization. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ridge on Validation: Mean_squared_error: 0.589863,  R_square_score: 0.357331\n",
      "Ridge model weights: \n",
      " [-1.62659058e+00 -4.04096046e-01  5.70523894e-02 -2.28531118e-02\n",
      "  7.71254696e-02 -1.13718761e-02  6.15333908e-02  8.28863853e-02\n",
      "  4.20569678e-02  1.26957551e-02  4.81132566e-02  1.69045396e-01\n",
      "  4.91422956e-02 -1.46089982e-02 -4.38431715e-02 -4.25339016e-03\n",
      "  3.42315374e-02  7.39254670e-02  1.16588148e-02  1.88175073e-03\n",
      " -2.01580295e-02  3.50815165e-03  1.37455954e-02 -2.47344739e-02\n",
      " -3.29418280e-02 -9.54068234e-03 -1.23129395e-01 -6.14422230e-02\n",
      "  5.23155119e-02 -8.25802438e-03  7.99729823e-03  1.21703514e-02\n",
      " -4.80660397e-02 -5.28818784e-02  5.15501054e-02 -5.81213299e-02\n",
      "  1.71252129e-02 -4.24460056e-02 -1.33928303e-02  3.37290359e-02\n",
      " -1.11737401e-02  9.47427793e-02 -7.60915290e-03 -1.67582619e-02\n",
      " -1.33907632e-02 -3.38346555e-02 -5.10335094e-02 -9.36197533e-02\n",
      "  6.97817233e-03  4.24563094e-02 -1.73695417e-02  7.45381237e-02\n",
      "  4.76347214e-03 -4.80396324e-02  8.47983576e-02  6.29906726e-02\n",
      "  6.52725834e-02 -4.76258501e-02  4.67004612e-02  2.80724503e-02\n",
      " -1.19702419e-02  2.69562266e-02  1.70576119e-02  8.34370796e-03\n",
      "  2.65589941e-02  4.28777121e-02  3.81423942e-02  1.03665695e-01\n",
      "  6.45795791e-02  8.16344429e-02  8.54056560e-02 -1.96452692e-02\n",
      " -3.92154817e-02  1.17382908e-01  1.00207955e-01  2.54672410e-02\n",
      "  5.29488760e-02  9.85790631e-03  1.79967085e-03  1.13636350e-02\n",
      "  1.37749955e-02  1.71365994e-02  1.51686292e-02  3.31342330e-02\n",
      "  8.28024318e-02  2.52904389e-02  6.93378805e-02  9.68778341e-03\n",
      " -3.78439055e-03 -2.35844199e-02  7.95316577e-02  4.23079813e-02\n",
      "  2.39725550e-02  1.18743394e-02 -2.63054522e-03  1.97070830e-02\n",
      "  2.37748430e-02  3.45474802e-02  2.54008856e-03 -6.30928300e-03\n",
      "  6.12319435e-03 -1.01566463e-02  1.83452900e-02  5.96435926e-02\n",
      " -3.03213809e-03  5.97076535e-02  6.50835853e-03  1.21535644e-01\n",
      "  2.25392513e-02  3.32513765e-02  1.31445458e-01 -3.85095510e-03\n",
      "  4.04490673e-04 -1.70284126e-02  2.51633943e-02  1.16361112e-01\n",
      "  1.66040670e-02 -3.59079768e-02  1.73693989e-02  3.15850824e-02\n",
      "  1.09751210e-01 -2.45129840e-02  1.64790768e-02 -2.39056392e-02\n",
      "  2.31421690e-02  5.79940074e-02  5.56586322e-02  2.07039661e-02\n",
      "  7.31487381e-02  5.80457260e-02  2.91350454e-02  2.48693687e-02\n",
      "  1.14228851e-02 -3.49501093e-02  5.16214619e-02 -8.66296021e-02\n",
      "  2.78702080e-02 -1.36879762e-02  4.73342136e-02 -4.23450815e-02\n",
      " -2.42186229e-03  8.67764884e-03  4.55676419e-02  5.82758351e-02\n",
      "  3.28180528e-02  8.89890639e-03 -3.28105296e-02  1.07601135e-02\n",
      "  6.13222148e-02  5.98664362e-02  3.61734115e-02  1.23789555e-02\n",
      " -7.94745453e-02  9.32985179e-02  6.02208708e-02  1.65315560e-02\n",
      "  2.93044321e-02  1.58352749e-02  1.19414728e-01  3.63831124e-02\n",
      "  1.82964328e-02 -2.82301796e-02  1.82729707e-02  4.64089148e-02\n",
      " -3.15347074e-02 -4.57846930e-02  1.89790014e-02 -2.55437751e-02\n",
      "  1.03284658e-02  4.48175900e-02  4.02014403e-02 -3.66890717e-02\n",
      "  2.05616791e-02 -7.40296638e-02  6.75526767e-04  4.85402563e-02\n",
      " -1.13000277e-02 -1.70867782e-03 -9.14970844e-03 -1.07217336e-01\n",
      "  1.38202528e-02  1.07231367e-02  2.23966420e-02 -1.99428501e-02\n",
      "  5.64500054e-03  3.90419940e-02  1.26463788e-02  6.07820757e-02\n",
      "  8.16749970e-03  3.75938881e-02  6.46510189e-02  6.03700217e-02\n",
      "  5.19989817e-02  4.92217264e-02  1.13038650e-02  2.74314177e-02\n",
      "  5.46947438e-02  1.16565629e-01  6.14879678e-04  7.35530324e-03\n",
      " -2.09973387e-02  1.57507891e-02]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "from sklearn.metrics import r2_score, mean_squared_error\n",
    "\n",
    "# Let's update the pipeline with Ridge regression model\n",
    "ridge_pipeline = Pipeline([\n",
    "    ('data_preprocessing', data_preprocessor),\n",
    "    ('ridge', Ridge(alpha = 100))\n",
    "])\n",
    "\n",
    "ridge_pipeline.fit(X_train[model_features], y_train.values)\n",
    "ridgeRegressor_val_predictions = ridge_pipeline.predict(X_val[model_features])\n",
    "\n",
    "print(\"Ridge on Validation: Mean_squared_error: %f,  R_square_score: %f\" % \\\n",
    "      (mean_squared_error(y_val, ridgeRegressor_val_predictions),r2_score(y_val, ridgeRegressor_val_predictions)))\n",
    "\n",
    "print(\"Ridge model weights: \\n\", ridge_pipeline.named_steps['ridge'].coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.3 LASSO (Linear Regression with L1 regularization)\n",
    "Let's also fit __Lasso__ from Sklearn library, and check the performance on the validation dataset.\n",
    "\n",
    "Find more details on __Lasso__ here:\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html\n",
    "\n",
    "__Lasso__ is tuning model complexity by adding a $L_1$ penalty score for complexity to the model cost function:\n",
    "\n",
    "$$\\text{C}_{\\text{regularized}}(\\textbf{w}) = \\text{C}(\\textbf{w}) +  alpha∗||\\textbf{w}||_1$$\n",
    "\n",
    "where $\\textbf{w}$ is the model weights vector, and $||\\textbf{w}||_1 = \\sum |\\textbf{w}_i|$. \n",
    "\n",
    "Again, the strength of the regularization is controlled by the regularizer parameter, $alpha$. Due to the geometry of $L_1$ norm, with __Lasso__, some of the weights will shrink all the way to 0, leading to sparsity - some of the features are not contributing to the model afterall!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Lasso on Validation: Mean_squared_error: 0.589867,  R_square_score: 0.357327\n",
      "Lasso model weights: \n",
      " [-1.72010524e+00 -3.89067686e-01  2.58213525e-02 -0.00000000e+00\n",
      "  3.82842552e-02 -0.00000000e+00  0.00000000e+00  0.00000000e+00\n",
      "  0.00000000e+00  0.00000000e+00  1.06602376e-02  1.26472385e-01\n",
      "  0.00000000e+00 -0.00000000e+00 -2.50955076e-02  0.00000000e+00\n",
      "  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00\n",
      " -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00\n",
      " -0.00000000e+00 -0.00000000e+00 -4.63960564e-02 -5.02315491e-02\n",
      "  0.00000000e+00 -0.00000000e+00  0.00000000e+00  0.00000000e+00\n",
      " -0.00000000e+00 -0.00000000e+00  3.47943828e-02 -1.61006740e-02\n",
      "  0.00000000e+00 -7.44345517e-03 -0.00000000e+00  0.00000000e+00\n",
      " -0.00000000e+00  1.95457533e-02 -0.00000000e+00 -0.00000000e+00\n",
      "  0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -0.00000000e+00\n",
      " -0.00000000e+00  0.00000000e+00 -5.88222939e-05  2.42462224e-02\n",
      "  0.00000000e+00 -1.67083722e-02  7.65288954e-02  5.11191702e-02\n",
      "  6.52495708e-02 -1.95841027e-02  4.24640009e-02  1.94354445e-02\n",
      " -0.00000000e+00  2.31689366e-02  6.60345613e-03  0.00000000e+00\n",
      "  7.77241315e-03  4.24802208e-02  2.46611963e-02  9.84988824e-02\n",
      "  5.66360446e-02  7.04583867e-02  7.39390006e-02 -7.30202686e-03\n",
      " -2.01568144e-02  1.09070226e-01  9.85161084e-02  1.89173069e-02\n",
      "  4.21733593e-02  8.92224900e-03  0.00000000e+00  1.71690041e-03\n",
      "  1.12690153e-02  8.37400874e-03  0.00000000e+00  2.08424501e-02\n",
      "  6.58613271e-02  1.11684319e-02  6.50926378e-02  0.00000000e+00\n",
      "  0.00000000e+00 -0.00000000e+00  7.75635511e-02  1.81639732e-02\n",
      "  1.34094779e-02  3.47083919e-03  0.00000000e+00  1.94450618e-02\n",
      "  1.64302217e-02  2.08545783e-02  0.00000000e+00 -0.00000000e+00\n",
      "  4.76631063e-04 -0.00000000e+00  1.11247950e-02  4.88585581e-02\n",
      "  0.00000000e+00  4.50304949e-02  0.00000000e+00  1.18703397e-01\n",
      "  0.00000000e+00  3.08937134e-02  1.23558636e-01  0.00000000e+00\n",
      "  0.00000000e+00 -0.00000000e+00  1.16569677e-02  1.21335603e-01\n",
      "  1.45310173e-02 -5.73789112e-03  0.00000000e+00  1.61575418e-02\n",
      "  1.09974422e-01 -0.00000000e+00  5.62051077e-03 -4.01496300e-03\n",
      "  7.70560592e-03  5.84355530e-02  5.64992818e-02  4.38075213e-03\n",
      "  6.32725981e-02  4.04038476e-02  1.22287629e-02  2.22709234e-02\n",
      "  1.04258533e-02 -1.31483697e-02  4.90841487e-02 -7.81193766e-02\n",
      "  2.65567658e-02 -0.00000000e+00  4.76621958e-02 -7.68826444e-03\n",
      "  0.00000000e+00  0.00000000e+00  2.78708595e-02  4.61145064e-02\n",
      "  4.89654813e-03  7.44021251e-03 -2.00508423e-02  9.89654553e-03\n",
      "  6.13162205e-02  5.69293636e-02  1.80026670e-02  0.00000000e+00\n",
      " -6.68203206e-02  8.53132296e-02  5.07280820e-02  1.31730549e-02\n",
      "  2.11212109e-02  0.00000000e+00  1.17546023e-01  2.52114748e-02\n",
      "  7.19393639e-03 -4.51436030e-05  1.42525486e-02  4.03626273e-02\n",
      " -8.38373569e-03 -6.87175976e-03  4.08641734e-03 -0.00000000e+00\n",
      "  0.00000000e+00  3.92851230e-02  4.45392213e-02 -5.89714208e-05\n",
      "  1.65714427e-02 -3.69897761e-02  0.00000000e+00  5.00988888e-02\n",
      " -0.00000000e+00  0.00000000e+00  0.00000000e+00 -9.13432006e-02\n",
      "  0.00000000e+00  4.31028874e-03  1.26817405e-02 -2.24943717e-03\n",
      "  4.04191830e-03  3.34928099e-02  2.29938781e-03  5.26626213e-02\n",
      "  4.47070643e-03  3.24783959e-02  6.23245733e-02  5.60383137e-02\n",
      "  5.44973737e-02  4.95496624e-02  7.21334713e-03  2.96171512e-02\n",
      "  4.54341406e-02  1.06924613e-01 -0.00000000e+00  0.00000000e+00\n",
      " -5.10833926e-03  1.40971226e-02]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import Lasso\n",
    "from sklearn.metrics import r2_score, mean_squared_error\n",
    "\n",
    "# Let's update the pipeline with Lasso regression model\n",
    "lasso_pipeline = Pipeline([\n",
    "    ('data_preprocessing', data_preprocessor),\n",
    "    ('lasso', Lasso(alpha = 0.001))\n",
    "])\n",
    "\n",
    "lasso_pipeline.fit(X_train[model_features], y_train.values)\n",
    "lassoRegressor_val_predictions = lasso_pipeline.predict(X_val[model_features])\n",
    "\n",
    "print(\"Lasso on Validation: Mean_squared_error: %f,  R_square_score: %f\" % \\\n",
    "      (mean_squared_error(y_val, lassoRegressor_val_predictions),r2_score(y_val, lassoRegressor_val_predictions)))\n",
    "\n",
    "print(\"Lasso model weights: \\n\", lasso_pipeline.named_steps['lasso'].coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.4 ElasticNet (Linear Regression with L2 and L1 regularization)\n",
    "Let's finally try __ElasticNet__ from Sklearn library, and check the performance on the validation dataset.\n",
    "\n",
    "Find more details on __ElasticNet__ here:\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html\n",
    "\n",
    "__ElasticNet__ is tuning model complexity by adding both $L_2$ and $L_1$ penalty scores for complexity to the model's cost function:\n",
    "\n",
    "$$\\text{C}_{\\text{regularized}}(\\textbf{w}) = \\text{C}(\\textbf{w}) +  0.5*alpha∗(1-\\textit{l1}_{ratio})||\\textbf{w}||_2^2 + alpha∗\\textit{l1}_{ratio}∗||\\textbf{w}||_1$$\n",
    "\n",
    "and using two parameters, $alpha$ and $\\textit{l1}_{ratio}$, to control the strength of the regularization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ElasticNet on Validation: Mean_squared_error: 0.589963,  R_square_score: 0.357222\n",
      "ElasticNet model weights: \n",
      " [-1.68706874e+00 -4.07625447e-01  5.80533276e-02 -1.74342339e-02\n",
      "  7.71012102e-02 -2.79938007e-03  6.08704201e-02  8.27085819e-02\n",
      "  4.01079987e-02  1.22164446e-02  4.90111408e-02  1.81185231e-01\n",
      "  5.72731714e-02 -7.48911322e-03 -4.03361893e-02 -4.30131224e-04\n",
      "  3.32676913e-02  7.24489929e-02  1.04312914e-02  1.44020260e-03\n",
      " -1.70751670e-02  0.00000000e+00  5.69117077e-03 -1.75438933e-02\n",
      " -2.71578678e-02 -7.12566653e-03 -1.26970793e-01 -5.45150508e-02\n",
      "  5.15288749e-02 -8.76079762e-04  7.65163810e-03  9.88670210e-03\n",
      " -4.33266719e-02 -4.74613861e-02  5.27279477e-02 -6.00927840e-02\n",
      "  1.05449510e-02 -4.03282619e-02 -0.00000000e+00  3.02110416e-02\n",
      " -2.58696608e-03  9.72705366e-02 -0.00000000e+00 -1.08439478e-02\n",
      " -8.05392460e-03 -2.82054308e-02 -4.79513248e-02 -8.53745513e-02\n",
      "  0.00000000e+00  4.13017601e-02 -1.32683702e-02  7.47598512e-02\n",
      "  4.29318642e-03 -4.45466901e-02  8.62029311e-02  6.26287563e-02\n",
      "  6.45280064e-02 -4.50540665e-02  4.86399069e-02  2.68667074e-02\n",
      " -1.01189203e-02  2.55678660e-02  1.50411873e-02  5.94743167e-03\n",
      "  2.45497805e-02  4.18197492e-02  3.75863429e-02  1.05350785e-01\n",
      "  6.45827819e-02  8.16920678e-02  8.50396766e-02 -1.92542247e-02\n",
      " -3.84084998e-02  1.18160062e-01  1.01825789e-01  2.45545932e-02\n",
      "  5.25886285e-02  1.15784649e-02  1.22055403e-03  9.17429902e-03\n",
      "  1.19036314e-02  1.64724437e-02  1.35701623e-02  3.11118328e-02\n",
      "  8.30514230e-02  2.43230082e-02  6.87413774e-02  8.40801281e-03\n",
      " -1.89335613e-03 -2.18277179e-02  7.88095533e-02  4.11053130e-02\n",
      "  2.18727044e-02  1.17358428e-02 -5.38738308e-03  1.87305169e-02\n",
      "  2.24699570e-02  3.31072597e-02  1.69661089e-03 -4.62130160e-03\n",
      "  5.04998495e-03 -8.68897055e-03  1.68907133e-02  5.91045780e-02\n",
      " -7.87592543e-04  5.86171776e-02  4.09090917e-03  1.22948260e-01\n",
      "  1.93750677e-02  3.23713804e-02  1.33235522e-01 -0.00000000e+00\n",
      "  1.04567821e-04 -1.52113462e-02  2.39688365e-02  1.17479237e-01\n",
      "  1.60226689e-02 -3.45103379e-02  1.46060847e-02  3.13436287e-02\n",
      "  1.10006644e-01 -2.29315833e-02  1.60635679e-02 -2.33524240e-02\n",
      "  2.13945444e-02  5.73501143e-02  5.54829406e-02  1.90147288e-02\n",
      "  7.28260875e-02  5.60818827e-02  2.76391306e-02  2.42511949e-02\n",
      "  1.15196184e-02 -3.36649539e-02  5.10161063e-02 -8.89813950e-02\n",
      "  2.64829788e-02 -1.25957025e-02  4.62400475e-02 -3.97579749e-02\n",
      " -0.00000000e+00  7.90135047e-03  4.52666170e-02  5.78268289e-02\n",
      "  2.84180954e-02  8.85493041e-03 -3.37988175e-02  8.80560276e-03\n",
      "  5.99676159e-02  5.96939271e-02  3.44039817e-02  1.12371157e-02\n",
      " -8.01106811e-02  9.35775293e-02  5.93942119e-02  1.54218302e-02\n",
      "  2.84324048e-02  1.35352644e-02  1.19857160e-01  3.49082498e-02\n",
      "  1.61413198e-02 -2.68365149e-02  1.74494864e-02  4.60901761e-02\n",
      " -3.02599073e-02 -4.46047674e-02  1.73595678e-02 -2.36046871e-02\n",
      "  7.97216178e-03  4.48052587e-02  3.89272497e-02 -3.40282621e-02\n",
      "  1.97509694e-02 -7.30542845e-02  0.00000000e+00  4.74082231e-02\n",
      " -7.51778123e-03 -2.05099807e-03 -7.61479018e-03 -1.07341257e-01\n",
      "  1.23865483e-02  9.08048512e-03  2.03503967e-02 -1.88313074e-02\n",
      "  4.77047920e-03  3.84286106e-02  1.19681889e-02  5.90752754e-02\n",
      "  7.18757654e-03  3.60911593e-02  6.44898529e-02  5.95977797e-02\n",
      "  5.22282033e-02  4.88821127e-02  1.09551351e-02  2.82955014e-02\n",
      "  5.41002158e-02  1.17644956e-01  0.00000000e+00  4.00313384e-03\n",
      " -2.03658410e-02  1.64129324e-02]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import ElasticNet\n",
    "from sklearn.metrics import r2_score, mean_squared_error\n",
    "\n",
    "# Let's update the pipeline with ElasticNet regression model\n",
    "elastic_net_pipeline = Pipeline([\n",
    "    ('data_preprocessing', data_preprocessor),\n",
    "    ('elastic_net', ElasticNet(alpha = 0.001, l1_ratio = 0.1))\n",
    "])\n",
    "\n",
    "elastic_net_pipeline.fit(X_train[model_features], y_train.values)\n",
    "enRegressor_val_predictions = elastic_net_pipeline.predict(X_val[model_features])\n",
    "\n",
    "print(\"ElasticNet on Validation: Mean_squared_error: %f,  R_square_score: %f\" % \\\n",
    "      (mean_squared_error(y_val, enRegressor_val_predictions),r2_score(y_val, enRegressor_val_predictions)))\n",
    "\n",
    "print(\"ElasticNet model weights: \\n\", elastic_net_pipeline.named_steps['elastic_net'].coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7.5 Weights shrinkage and sparsity\n",
    "\n",
    "Let's compare weights ranges for all these regression models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LinearRegression weights range: \n",
      " 3.816662027557287e-05 1.7457730959737865\n",
      "Ridge weights range: \n",
      " 0.00040449067335434973 1.626590581652151\n",
      "Lasso weights range: \n",
      " 0.0 1.7201052442656493\n",
      "ElasticNet weights range: \n",
      " 0.0 1.6870687401639748\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "lin_regression_coeffs = pipeline.named_steps['lr'].coef_\n",
    "ridge_regression_coeffs = ridge_pipeline.named_steps['ridge'].coef_\n",
    "lasso_regression_coeffs = lasso_pipeline.named_steps['lasso'].coef_\n",
    "enet_regression_coeffs = elastic_net_pipeline.named_steps['elastic_net'].coef_\n",
    "\n",
    "print('LinearRegression weights range: \\n', np.abs(lin_regression_coeffs).min(), np.abs(lin_regression_coeffs).max())\n",
    "print('Ridge weights range: \\n', np.abs(ridge_regression_coeffs).min(), np.abs(ridge_regression_coeffs).max())\n",
    "print('Lasso weights range: \\n', np.abs(lasso_regression_coeffs).min(), np.abs(lasso_regression_coeffs).max())\n",
    "print('ElasticNet weights range: \\n', np.abs(enet_regression_coeffs).min(), np.abs(enet_regression_coeffs).max())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The weights of all regularized models are lowered compared to __LinearRegression__, with some of the weights of __Lasso__ and __ElasticNet__ shrinked all the way to 0. Using sparsity, the __Lasso__ regularization reduces the number of features, performing feature selection."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. <a name=\"7\">Ideas for improvement</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "One way to improve the performance of a linear regression model is to try different strenghts of regularization, here controlled by the parameters $alpha$ and $\\textit{l1}_{ratio}$."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "sagemaker-distribution:Python",
   "language": "python",
   "name": "conda-env-sagemaker-distribution-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
